Sequence Alignments
Introduction Sequence alignments can be compiled using DNA, RNA or amino acids. There are many flavors including: pairwise, multiple, and structural alignments. Although structure based alignments are very helpful and may be less biased it won't be talked about in this wikia because they require a crystal structure and start to stray from genetics. A pairwise alignment aligns two sequences while multiple sequence alignments align three or more. They are both helpful in different cases. Pairwise alignments are more efficient because of the smaller volume of data. This makes them good for searching databases like blast and gene (linkout) for homologous genes. From these searches a multiple sequence alignment can be generated. The multiple alignment contains more information and shows sequence conservation throughout multiple organisms. Sequences can also be aligned locally (small regions) or globally (entire length). Depending upon the conservation between the sequences one may be better than the other. It is very difficult to align unrelated sequences constructively. How they work Pairwise alignments There are many different programs one can use to generate an alignment. Each program may use a slightly different algorithm, you may find one that you like better, or your group may already have a favorite. This method uses comparison matrices. The algorithm scores identities, mismatches, and gaps. When aligning amino acids this becomes more complex, but we aren't focusing on that in our review. A gap is both an insertion and a deletion, it is an insertion in one sequence and a deletion in the other. Allowing gaps in your alignment increases the complexity as well. Gaps are necessary but should be kept to a minimum. To help control the number of gaps the scoring can be changed to make the introduction of a gap unfavorable. This means that the introduction of the gap must be overcome by the similarity of sequence downstream. Programs can also construct dot plots, where a dot is drawn everywhere the sequences match. The program can be edited to require short regions of sequence similarity to reduce visual noise. This will result in diagonal lines representing sequence identity. Multiple sequence alignments This method expands on the comparison matrix used for pairwise alignments. Again there are different algorithms for different programs. Most programs do initially run the pairwise comparison matrix to find the two most similar sequences. This comparison usually uses a less complex matrix though as to not take up to much computing power. From there it will continuing filling out the sequence alignment and scoring the additions. The servers are constantly trying to improve and make this process quicker. During multiple sequence alignments it is important to make sure the sequences are somewhat homologous. Global alignments use the Needleman-Wunsch algorithm, this is what Clustal has modified to run multiple sequence alignments. Clustal is one of the most common servers for multiple alignments. There is an online webserver or a package can be downloaded to run on Unix, Windows, or Mac OS. Quick multiple sequence alignments can also but down using BLAST, UniProt, EXPASY, etc. These programs can also do pairwise alignments. Local alignments use the Smith-Waterman algorithm, this finds the most similar region between the sequences. A local alignment would be best suited for looking for motiffs, or conserved folds between different proteins. For both types it is more difficult to align nucleic acid sequences than amino acid sequences. DNA and RNA have the wobble base. This means that multiple codons can yeild the same amino acid. Therefore they can be divergence in the nucleic acid sequence but not in the amino acid sequence between species. Also there needs to be some human intervention before calling an alignment done. It is important that it is checked and appropriate changes are made. Uses Sequence alignments have become necessary for bioinformatics. With the growing amount of genomic data available there needs to be efficient ways of analyzing the data. The data needs to aligned reliably as well to ensure that the variant found is real and not an artifact of a poor alignment. They are also a great tool for analyzing the conservation of genes from humans to other mammals to simple eukaroytes to prokaryotes. The conservation is important to many fields from evolutionists to cell biologists to structural biologists. Certain programs like Clustal can even make a phylogenetic tree. References Smith, TF, Waterman, MS. Identification of common molecular subsequences. J Mol Biol, 1981; 147(1):195-7. PMID:7265238. Fundamentals of Sequence Analysis, 1998-1999.